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We provide here a framework to analyze the phase transition phe¬ 
nomenon of slice inverse regression (SIR), a supervised dimension 
reduction technique introduced by Li [1991]. Under mild conditions, 
the asymptotic ratio p — limp/n is the phase transition parameter 
and the SIR estimator is consistent if and only if p = 0. When dimen¬ 
sion p is greater than n, we propose a diagonal thresholding screening 
SIR (DT-SIR) algorithm. This method provides us with an estimate 
of the eigen-space of the covariance matrix of the conditional expec¬ 
tation war(E [a; I j/j). The desired dimension reduction space is then 
obtained by multiplying the inverse of the covariance matrix on the 
eigen-space. Under certain sparsity assumptions on both the covari¬ 
ance matrix of predictors and the loadings of the directions, we prove 
the consistency of DT-SIR in estimating the dimension reduction 
space in high dimensional data analysis. Extensive numerical exper¬ 
iments demonstrate superior performances of the proposed method 
in comparison to its competitors. 


1. Introduction. For a continuous multivariate random variable (y, x) 
where x gMP and y E M, a subspace S' C is called the effective dimension 
reduction (EDR) space if y X x\Psi{x) where X stands for independence. 
Under mild conditions (Cook [1996]), the intersection of all the EDR spaces 
is again an EDR space, which is denoted as S and called the central space. 
Many algorithms were proposed to find such subspace S under the assump¬ 
tion d = dimS <C p- This line of research is commonly known as sufficient 
dimension reduction. The Sliced Inverse Regression (SIR, Li [1991]) is the 
first, yet the most widely used method in sufficient dimension reduction, 
due to its simplicity, computational efficiency and generality. The asymp¬ 
totic properties of SIR are of particular interest in the last two decades. The 
consistency of SIR has been proved for fixed p in Li [1991], Hsing and Carroll 
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[1992], Zhu and Ng [1995] and Zhu and Fang [1996]. Later, Zhu et al. [2006] 
have obtained the consistency if p = o{y^). A similar restriction also ap¬ 
pears in two recent work (see Zhong et al. [2012] and Jiang and Liu [2014]). 
When p > n, a, common strategy pursued by many recent researchers is to 
make sparsity assumptions that only a few predictors play a role in explain¬ 
ing and predicting y and apply various regularization methods. For instance, 
Li and Nachtsheim [2006], Li [2007] and Yu et al. [2013] applied LASSO 
(Tibshirani [1996]), Dantzig selector (Candes and Tao [2007]) and elastic 
net (Zou and Hastie [2005]) respectively to solve the generalized eigenvalue 
problems raised by a variety of SDR algorithms. 

However, a piece of jigsaw is missing in the understanding of SIR. If the 
dimension p diverges as n increases, when will the SIR break down? A sim¬ 
ilar question has been asked for a variety of SDR estimates in Cook et al. 
[2012]. In this paper, we prove that, under certain technical assumptions, 
the SIR estimator is consistent if and only if p = lim ^ = 0. Such a re¬ 
sult on inconsistency provides theoretical justifications for imposing certain 
structural assumption, such as sparsity, in high dimensional settings. This 
behavior of SIR in high dimension, which will be called the phase transition 
phenomenon, is similar to that of the principal component analysis (PCA), 
an unsupervised counterpart of SIR. This extension is, however, by no means 
trivial. After all the samples {yi,Xi) are sliced into H bins according to the 
order statistics of y* , the sliced samples are neither independent nor identi¬ 
cally distributed. This difference increases the difficulty significantly. In this 
paper, we provide a new framework to study the phase transition behaviour 
of SIR. The technical tools developed here can potentially be extended to 
study the phase transition behaviour of other SDR estimators. 

The second part of the article aims at extending the original SIR to the 
scenario with ultra-high dimension [p = o(exp(n^))). Based on equation (3) 
in Section 2, the central space can be estimated by the column space of 
col{v h), where is any consistent estimate of the precision matrix 
and col{v h) is the estimate of the space col{var{K\x\y\)). To estimate 
the column space of var (E[®|y]), we propose a diagonal screening procedure 
based on new univariate statistics varH{x{k)), which are the diagonal ele¬ 
ments of var (E[a;|y]), motivated by recent work in sparse PCA (Johnstone 
and Lu [2004]). After ranking the predictors according to the magnitude 
of varH{x{k)) decreasingly, we choose the set X consisting of the first R 
predictors as active predictors. The SIR procedure is subsequently applied 
to these selected predictors to estimate the d-dimensional column space of 
var{K[x^\y]) by coI{v^h) where is the matrix formed by the top d eigen¬ 
vectors of Ajj. We embed into by filling in O’s for entries outside 
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the chosen row set X, and denote this new matrix by e{v\). The estimate 
of the central space is defined to be co/(s^^e(v^)). We name this two-stage 
algorithm as Diagonal Thresholding SIR (DT-SIR), and prove that DT- 
SIR is consistent in estimating the central space under certain regularity 
conditions. Extensive simulation studies show that DT-SIR performs better 
than its competitors and is computationally efficient. 

The rest of the paper is organized as follows. In Section 2, we briefly 
describe the SIR procedure and introduce the notations. In Section 3, af¬ 
ter a brief review of existing asymptotic results of SIR procedure, we state 
Theorems 2 and 3 to discuss the phase transition phenomenon of SIR. In 
Section 4, we propose the DT-SIR method and show that DT-SIR is consis¬ 
tent in high dimensional data analysis. In Section 5, we provide simulation 
studies to compare DT-SIR with its competitors. Concluding remarks and 
discussions are put in Section 6. All the proofs are presented in appendices. 

2. Preliminaries and notations. 

2.1. Sliced inverse regression Consider the multiple index model 

( 1 ) y = ,f3'^^x,e) 

where x € e is the noise and / is an unknown link function. Without 
loss of generality, we assume that £[*] = 0 G Although the px d matrix 
V = {Pi, ■ ■ ■ ,P^) is not identifiable, the space spanned by the P’s, which is 
called the column space of V and denoted hy col{V), might be identihed. Li 
[1991] proposed the Sliced Inverse Regression (SIR) procedure to estimate 
the central space col{V) without knowing f{-), which can be briefly sum¬ 
marized as follows: Given n i.i.d. samples {yi,Xi), i = I,-- - ,n, SIR hrst 
divides them into H equal-sized slices according to the order statistics y{i)-^ 
We re-express the data as yhj and x^j, where {h,j) is the double subscript 
in which h refers to the slice number and j refers to the order number of a 
sample in the h-th slice, i.e., 

yh,j = y[c{h-i)+j)i Xhj = ®(c(h-i)-i-j)- 
Here is the concomitant of yi^ky Let the sample mean in the h-th. slice 
be Xfi^., and let the mean of all the samples be x. Then, Ap = var{E,[x\y]) 
can be estimated by: 

1 ^ 

( 2 ) = J^'^Xh,xl.. 

h=l 

^To ease notations and arguments, we assume that n = cH and H = o (log(n) A log(p)) 
throughout the article. 
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Based on the observation that 

(3) col{A) = i:^col{V), 

the SIR then estimates the central space col{V) by T,~^coI{Vh) where Vh is 
the matrix formed by the top d eigenvectors of Ah-- Throughout the article, 
we assume that d is fixed and the d-th largest eigenvalue of Ap is bounded 
away from 0 when n,p —oo. 

In order for SIR to result in a consistent estimate of the central space, Li 
[1991] imposed the the following two conditions: 

• (Al). Linearity condition: For any ^ E is 

a linear combination oi f3'[x, ■ ■ ■ , P^x. 

• (A2). Coverage condition: The dimension of the space spanned by 
the central curve equals the dimension of the central space, i.e., d' = d. 

2.2. Further Notations. Let Sh be the h-th interval {yh-i,c, yh,c] for 2 < 
h < H — 1, Si = (—oo,yi^c] and Sh = {yH-i,c,oo). Note that these in¬ 
tervals depend on the order statistics and are thus random. For any 
CO in the product sample space, we define a random variable 5h = 6h{co) = 
fyeSh{(^) fiy) is the density function of y. For X C {1, • • • , n}, J C 

{I,-- - ,p} and a. n X p matrix A, denotes the \X\ x \ J'\ sub-matrix 

formed by restricting the rows of A to X and columns to J. In articular, 
A~'^ denotes the sub-matrix formed by restricting the columns to J; For 
a matrix B = A^’*^ E we embed it into RP^^ by putting 0 on 

entries outside Xx J and denote the new matrix as e{B). Similar notations 
apply to vectors. For two positive numbers a and b, we let a V 6 = max{a, b} 
and let a A 6 = min{a,6}. Let T{x,t) = x x l(|x| > t) be the hard thresh¬ 
olding function. Throughout the article, C, Ci and C 2 are used to denote 
generic absolute constants, though the actual value may vary from case to 
case. For a vector x, we denote its A:-th entry as x{k). Let Pi and P 2 be 
two vectors with the same dimension, the angle between these two vectors 
is denoted as Z{Pi,P 2 ). For two sequences {on}, {bn}, we let a„ <C bn stand 
for On = 0{b‘p) for some positive e < 1 and let a„ bn stand for lim ^ = 0. 

3. Consistency of SIR. In order to control the behavior of SIR, we 
need to impose the following boundedness condition (A3) on the predictors’ 
covariance matrix in addition to the tail condition (sub-Gaussian) on their 
joint distribution. We also need a condition (A4) for the central curve. 

• (A3) Boundedness Condition: x is sub-Gaussian; and there exist 
positive constants C'i,C '2 such that 

^ ^mini^x') P ^maxi^x) P C2 
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where \min{^x) and Xmaxi^x) are the minimal and maximal eigen¬ 
values of respectively. 

• (A4) The central curve m{y) = lE[ic|?/] has finite fourth moment and 
is i?-sliced stable (defined below) with respect to y and m{y). 


Definition 1. For two positive constants 7^^ < 1 < 72, let Ah{1ii12) 
he the colleetion of all the partition —00 = oq < ai < • • • < an-i < an = 00 
o/M satisfying that 

^ < Pitti <y < Oi+i) < 

The central curve m{y) = E[®|y] is called d-sliced stable with respect to y 
for some d > 0 if there exist positive constants -f^,i = 1,2,3 such that for 
any /3 in the central space for any partition in AH{h,^ 2 )> have 


( 4 ) 


1 I 

m 


^ var{l3Fm{y) 

h=0 


ah <y < ah+i) 


< ^var{(3'^m{y)). 


The central curve is sliced stable if it is d-sliced stable for some positive 
constant d. 


Remark 1. Note that we only need (4) to hold for all unit vectors in 
the central space by rescaling. By considering the orthogonal decomposition 
of (3 in a general space with respect to the central space and its complement, 
it is easy to see that the sliced stability implies that (4) holds true for all 
vector (3. In particular, we have the following two useful consequences of the 
slice-stability. 

i) By choosing j3'^ = (0,... , 0,1, 0,... , 0) with 1 at the k-th position, we 
have 

H 

I ^ var{m{y, k) \ at < y < < '^^H^~'^var{m{y, k)), 

h=0 

where m{y,k) is the k-th coordinate of the central curve m{y). 

a) Since equation (4) holds for all unit vector (3, we have 
H 

\\'^var[m{y) \ ah<y < Oft+Olb < l:iH^~'^\\var{ra{y))\\ 2 . 
h=0 
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Remark 2. Suppose E[m(y)] = 0 and there are n samples m* = m{yi). 
Let rrih^i andrrih^. he defined similarly to Xh,i andxh,., respectively. On one 
hand, we have the classic consistent estimator of var{m{y)). 

On the other hand, if we expect that the slice-based estimate '^h^h,-Ln\ . 
of var{m{y)) is consistent, we must require that the average loss of variance 
in each slice to decrease to zero as H increases, i.e., 

(5) ^ ^ = ^ 0. 

h i hi 

In Definition 1, we simply choose the decreasing rate to he a power of H. It 
would be easily seen that if m is smooth and y is compactly supported then 
(5) holds automatically. In this sense, for general curve m and random 
variable y, the sliced stability is a condition on smoothness of the central 
curve m and tail distribution ofm{y). This is not surprised at all, since most 
work on the consistency of SIR estimate requires some kind of smoothness 
for the central curve and a tail distribution control for m{y). 


The most popular smoothness and tail condition might be the one pro¬ 
posed by Hsing and Carroll [1992] (later used in Zhu et al. [2006], Zhu and 
Ng [1995]) in their proof of the consistency of SIR, which is explained be¬ 
low. For B > 0 and n > 1, let n„(R) be the collection of all the n-point 
partitions —B < < • • • < < R of [—R, R]. First, they assumed that 

the central curve m{y) satisfies the following smoothness condition 

n 

lim sup ||m(yj) - m(yi_i )||2 = 0, VR > 0. 

Second, they assumed that for Rq > 0, there exists a non-decreasing function 
fh{y) on (Ro,oo), such that 

(6) m^(y)R(|y| > y) —0 as y —)• oo 

||m(y) - m(y')||2 < |m(y) -m(y')| for y,y' E (-oo,-R q) U (Ro,oo) 

By changing the tail condition (6) to a slightly stronger condition E[m(y)^] < 
oo, Neykov et al. [2015] proved that the modified condition implies the sliced 
stability condition. Now, we are ready to state our main results. 


Theorem 1. Under conditions (Al), (A2), (A3) and (A4), we have 


||A// - Aplla = Op{jp + -L ■ 


n 


(7) 


n 
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The proof of the theorem is deferred to the Appendix. As a direct conse¬ 
quence of Theorem 1, we observe that if p = lim„^oo „ ~ may choose 
H = log (n/p) such that the right hand side of equation (7) converges to 0. 
Thus, Theorem 1 implies that Ah is a consistent estimate of Ap if p = 0. 

Remark 3 (More on Convergence Rate). Note that the convergence rate 
in (7) depends on the choice of H. This may seem not very desirable at the 
first glance. Since the convergence rate of Ah might be different from that 
of coI{Vh), we may expect that the convergence rate of col{V h) does not 
depend on the choice of H. In fact, we have 

(8) Ah - Ap = (^Ah - PvAhPv^ + {Pv^hPv - . 

From the proof of Theorem 1, we can easily check that the first term is of 
convergence rate ^—h ^and the second term is of rate Since 

PvAhPv Ap share the same column space, if we are only interested in 
estimating Py, then the convergence rate of the second term does not matter 
provided that H is a large enough integer, which may depend on d and 73 
but does not depend on n and p. For such an H, */M_h'( 7 i, 72 ) non-empty. 
Theorem 1 and ( 8 ) hold for both categorical and continuous response variable 
Y. 


Theorem 2. Under conditions (A1),(A2),(A3), (A4) and assuming 
that p = lim ^ = 0 , we have 

IIS 3 , Ah — 5 ]“^Ap ||2 —)• 0 os n —)• 00 
with probability converging to one, where Ylx = ■ 

We define the distance P( , W 2 ) of two d-dimensional subspaces V 1 and 
W 2 as the operator norm (or Frobenius norm) of the difference between Py^ 
and Py^ - Simple linear algebra shows that if the /3j’s satisfy 
then 

col{V) = sponj^i,--- ,Pd]- 

Let V be the matrix formed by the top d generalized eigenvectors of (S^. , Ah) 
Recall that the d-th eigenvalue of Ap is assumed to be bounded away from 
0. Therefore Theorem 2 implies that T){Py,Py) —)• 0 when p = 0. 

We have already shown that the SIR procedure provides us with a consis¬ 
tent estimate of the sufficient dimension reduction space when p = 0 under 
mild conditions. It is then natural to ask: is this condition necessary? Our 
next theorem gives the answer. 
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Theorem 3. Under conditions (Al), (A2), (A4) and assuming that 
X ~ A(0, Ip) for the single index model 


y = 


we have: 

(i) When p = lim 2 £ (0, oo), || A^f—Ap||2, as a function of p, is dominated 
by '/P V p when H,n ^ oo; 

(a) Let (3 be the principal eigenvector of the SIR estimator Ah- If p = 
lim f > 0, then there exists a positive constant c{p) > 0 such that 

liminf EZ(/3,/3) > c{p) 

n—>-oo 

with probability converges to one. 

We illustrate this result via a numerical study of the linear model 

(9) y = s’^/3 + e where = (1,0, • • • , 0), ® ~ -/V(0, Ip), e ~ A(0,1). 

Figure 1 shows how EZ(/3,/3) is related to the dimension p for fixed ratio 
P = ^ (taking values in {.1, .3, .7,1,2,4}), where /3 is estimated by the 
SIR with the slice number H = 10. For each p, EZ(/3,/3) is calculated 
based on 100 iterations. It is seen that this expected angle converges to a 
positive number when the ratio p is non-zero. In Figure 2, we have plotted 
the EZ(/3,/3) against the ratio p = ^, varying between 0.01 and 4 with an 
increment of 0.01. The sample size n is 200 and the slice number H is 10. It 
is seen that the expected angle decreases to zero as p approaches zero, and 
increases monotonically when p increases. 

Results in this section have shown that there is a phase transition phe¬ 
nomenon of the SIR procedure. That is, the estimate of the dimension reduc¬ 
tion space is consistent if and only if the ratio p = lim ^ = 0. This provides 
a theoretical justification of imposing additional structure assumption such 
as sparsity in high dimension. 

4. SIR in ultra-high dimension. As we have shown in Section 3, the 
SIR estimator fails to be consistent if p = lim ^ 7^ 0. Hence, when p 3> n, 
some structural assumptions are necessary for getting a consistent estimate 
of the central space. In this paper, we assume that both the loadings of all the 
directions /3j’s and the covariance matrix 'Sx are sparse. Other structural 
assumptions will be studied in our future work. For flfs, we impose the 
following prevalent sparsity condition. 
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Fig 1: Numerical approximations of EZ(/3,/3) for model (9) as a function of 
dimension p for p = .1, .3, .7, 1, 2, and 4, respectively (up left, up right, 
middle left, middle right, lower left, lower right), where (3 is estimated by 
SIR. 



Fig 2; The relationship of EZ(/3, (3) and the ratio p/n where (3 is estimated 
by SIR. 
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• (A5) s = |5| <C p where 5 = | i | / 0 for some j, 1 < j < d | 

and |5| is the number of elements in the set S. 

For Sa;, the following class of covariance matrices has been introduced in 
Bickel and Levina [2008] (see also Cai et al. [2010]). 

U{€o,a,C) = I Sa; : max^^{|(Tij| : \i — j\ > 1} < Cl~°‘ for all I > 0, 

^ i 

and 0 < €o ^ ^ ^maxi^x') ^ 1’ • 

eo J 

In this paper, to simplify the notations and arguments, we choose a slightly 
stronger condition. 

• (A6) Sa, € l/({eo,a,C) and maxi<j<prj is bounded where r* is the 
number of non-zero elements in the i-th row of Sa;. 

Let T = { k \ var (E[x{k)\y]) 7^ 0 }. If /c G T, there exists rj G col{A) such 
that r]{k) / 0. Since we have (3): 

'ExCol{V) = col{A), 

there exits a /3 G coliV) such that r] = T,x(3. Thus if /c G T, then k G 
supp{T,x(3) for some (3 G coliV). In particular, with the above sparsity 
assumptions (A5) and (A6) , we have |T| < smaxi<j<prj = 0{s)?‘ Note 
that our goal here is to recover the column space col{V) rather than S. 
Indeed, we are not able to consistently recover S unless for the trivial case. 
The key for recovering cov{V) is to consistently recovering the set T. 

At the population level, var(K{x{k))\y) can separate T from T^. When 
there are only finite samples, we use 

I ^ 

( 10 ) varH{x{k)) =—^Xh,{kf 

h=i 

as an estimate of var{K{x{k))\y). These are the diagonal elements of the 
matrix Ah- Note that these quantities depend on the sliced sample means, 
which are neither independent nor identically distributed. Thus, the usual 
concentration inequalities for are no longer applicable. We need extra 
efforts to get the concentration inequalities; this concentration result is one 
of the main technical contributions of this article, and can be further gen¬ 
eralized. 

^We could introduce ^ = maxi<i<p n, then |T| < s^. The arguments below still work, 
except we might need — o(p). 
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Remark 4. The link function /( ) is not involved explicitly in the defini¬ 
tion of varH{x{k)), and only the order statistics of the response is required. 
This nonparametric characteristic of the method is of particular interest to 
us and will he further investigated in a future research. Screening statistics 
inspired by the sliced inverse regression idea have been proposed in various 
formats, such as those in Jiang and Liu [2014], Zhu et al. [2012] and Cui 
et al. [2015]. 

With the quantities varHO^[x{k)\y]), we define the inclusion set Tp{t) and 
the exclusion set £p{t) below, which depend on a thresholding value t: 

'Zp{l) = I A: \varH{x{k)) > ^ | and £p{t) = | fe \varH{x{k)) < t 

Note that Tp{t) can be viewed as an estimate of T and is thus also denoted 
by T. After reducing the dimension to a level such as p/n is sufficiently 
small, the SIR estimator A ’ is a consistent estimate of Let V be the 
matrix formed by the top d eigenvectors of A^’^. We then use cofieiv'^)) to 
estimate the central space col{V), where is a consistent estimate of Sa,. 
Estimating the covariance matrix and precision matrix in high dimension is 
a challenging problem by itself and is not a main focus of this article. We 
employ the methods of Bickel and Levina [2008] to solve it. In summary, 
we propose the following Diagonal Thresholding screening SIR (DT-SIR) 
algorithm: 


Algorithm 1 DT-SIR 


1. Calculate varH{x{k)) according (10) for fe = 1, 2, • • • ,p; 

2. Let T = I fc I varH{x(k)) > ^ | Dr an appropriate t ; 

3. Let be the SIR estimator of the conditional covariance matrix for the data {y, 
according to equation (2); 

4. Let be the matrix formed by the top d eigenvectors of A^’^; 

5. flficol (e {v^)) is the estimate of col{V) 


A practical way to choose an appropriate t in step 2 will be presented in 
Section 5 . To ensure theoretical properties, we need an assumption on the 
signal strength: 

• (SI) 3 (7 > 0 and a; > 0 such that var{E[x{k)\y]) > Cs~^ when 
E[x{k)\y] is not a constant. 
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Theorem 4. Under conditions (Al) - (A6) and (SI), and let t = as ^ 
for some constant a > 0 such that t < ^var{m{y, k), VA: G T, we have 

%) T" C Ep holds with probability at least 

(11) 1 - Cl exp ^-C 2 j^ + Cs log{H) + log(p - s)) ; 
a) T C holds with probability at least 

(12) I-C 4 exp + Ce log{H) + log(s)^ , 

for some positive constants Ci, - ■ ■ , Cg. 

This theorem has a simple implication. If ^ log(p) + log(s), we may 
choose H = log( ^u,iog(p) ), 


n 




>~ log(p) + log{H) + log(s). 


Thus , we know T = Ip with probability converging to one. Next, we have 
results for the consistency of DT-SIR. 

Theorem 5. Under the same assumptions and choosing the same t as 
Theorem 4> I-f log(p) + log(s), we have 

||e(Ap^) — Ap ||2 —>• 0 as n ^ 00 

with probability converging to one, where T = I{t) and H = log( ^,^^”g^^^ ). 

Theorem 6 . Let Hx be the estimator of co-variance matrix from Bickel 
and Levina [2008]. Under the same assumptions of Theorem 5, we have 

IIe(A^’^) — 5 ]“^Ap ||2 —0 as n ^ 00 
with probability converging to one. 

5. Simulation Studies. We consider the following settings in generat¬ 
ing the design matrix x and the response y. In Settings I-III, each row of x 
is independently sampled from A( 0 ,I). 

• Setting I. yi = sin(xji -|- x^) -L exp(a:i 3 -|- xa) -|- 0.5 * e*, where e, ~ 

A(0,1); 







DT-SIR 


13 


• Setting II. yi = Y,]=iXij *e^v{xis + xig) + ei where e* ~ iV(0,1); 

• Setting III. yi = Xij * exp(X]^£ii Xij) + e* where e* ~ A^(0, 1); 

In Settings IV to VI, each row of x is independently sampled from N{0, S). 

• Setting IV. yi = {xn + Xi 2 + Xi3)^/2 + 0.5 * e*, where e* ~ N{0, 1) 

and S = {(Tij) is tri-diagonal with an = 1, ai^i+i = ai+i^i = p and 
2 

(^i,i+2 — <7i+2,i — P ; 

• Setting V. yi = YJj=i ^ij * exp(xj8 + Xig) -t- e^, where e* ~ iV(0,1), 
and S = B (g) Ip /10 with B = (6ij)i<i<io,i<i<io given as bij = pi*--?'!; 

• Setting VI. Assume the same setting as in Setting V except that 

S = (aij) is tri-diagonal with an = = crj+i,j = p and ai^i+g = 

(^i+2,i = P^- 

• Setting VII. Assume the same setting as in Setting V except that 
S = {aij) is given as aij = 

DT-SIR first screens all the predictors according to the statistic varH,c{x{k)), 
which requires a tuning parameter t. We chose t by using an auxiliary vari¬ 
able method based on an idea first proposed by Luo et al. [2006] and ex¬ 
tended by Wu et al. [2007] and Zhu et al. [2011]. In our setting, for a given 
sample {yi,Xi), we generate ~ V(0,lp/) where p' is sufficiently large and 
chosen as p in our simulation studies. It is known that y and 2 are indepen¬ 
dent. The threshold t can be chosen as 

t= max ^{varH,c{z{k)).} 

In DT-SIR, when n > 1000, H is chosen as 20; when n < 1000, H is chosen 
as 10 in the screening step and 20 in the SIR step. 

We also consider the following alternative methods in the screening step: 
Sure Independent Ranking and Screening (SIRS) in Zhu et al. [2011], SIR 
for variable selection via Inverse modeling (SIRI) in Jiang and Liu [2014], 
and trace pursuit in Yu et al. [2016]. As a comparison, we also considered 
two screening methods that are not based on the sliced regression: Distance 
correlation in Sz&ely et al. [2007] and SURE independence Fan and Lv 
[2008]. For SIRS, the threshold is chosen according to the auxiliary statistic 
(2.9) of Zhu et al. [2011]. For SIRI, the predictors are chosen according to 
10-fold cross validation. The threshold values and are chosen as 
the 10-th and 5-th quantile of a weighted distribution given in Theorem 
3.1 of Yu et al. [2016]. In both SURE and DC screening, the top [ynj where 
7 = 0.01 are kept for subsequent analyses. 

After the screening step, similar to DT-SIR, we then applied the SIR algo¬ 
rithm (steps 3-5 of DT-SIR) to estimate col{V). These alternative methods 
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are denoted as SIRS-SIR, SIRI-SIR, SURE-SIR, DC-SIR, and TP-SIR, re¬ 
spectively, in the following discussions. Another method that we compared 
with is the sparse SIR, abbreviated as SpSIR, proposed in Li [2007]. After 
obtaining an estimator col{V), we calculate 'D{P^^i^yy as a measure 

of the estimation error. We replicate this step 100 times, and calculate the 
average distance for the estimation result from each method and report these 
numbers in Table 1-3. For each setting, the average distance of the optimal 
method is highlighted using bold fonts. We further run a two-sample T-test 
to test if the actual estimation error of each method is significantly different 
from that of the best method for that example at 1% level of signihcance. 

Under all settings, the average distance obtained by DT-SIR was much 
smaller than that obtained by SpSIR and SURE-SIR. The p-values for com¬ 
paring DT-SIR and SpSIR/SURE-SIR are all signihcant at the 0.01 level. 
When p> n, the sparse SIR completely failed because the average distance 
of the estimated space to the true space is \/M, indicating that the space 
estimated by sparse SIR is orthogonal to the true space spanned by /3. 

Under settings II-IV, DT-SIR performed either the best or not signifi¬ 
cantly worse than the best method. For all other cases, DT-SIR performed 
the best except for a few cases: Setting I when n = 500, p = 1000, setting 

V when n = 500, p = 6000, setting VI when n = 500, p = 6000, and setting 
VII when n = 1000, p = 1000. 

When p = 6000, n = 500, both DT-SIR and SIRI-SIR were the winners. 
Under Setting III, DT-SIR performed better than SIRI-SIR; under settings 

V and VI, SIRI-SIR performed better than DT-SIR; under other settings, 
these two methods were comparable. 

To graphically show the performance of various methods, we consider 
setting IV with d = 1. Consider two cases when (n,p) = (2000,1000) and 
(n,p) = (500,100). We calculated the estimated directions /3 using various 
methods and computed the angle between < (3 > and < (3 >. We replicate 
this step 100 times to calculate the average angles for each method. The 
results are displayed in Figure 3, which shows clearly that DT-SIR performed 
better than its competitors. 

Additionally, DT-SIR is computationally efficient. To show this, we re¬ 
port the computing time for one replication under Setting II for various 
pairs of (n,p) in Table 4. All computations were done on a computer with 
Intel Xeon(R) E5-1620 CPU@3.70GHz and 16GB memory. It is clearly seen 
that DT-SIR performed as fast as SURE-SIR, and both were much faster 
than other competitors. Consider the case when p = 3000, n = 2000. The 
computation time of DT-SIR was only 30 seconds; while that for DC-SIR 
was 21 minutes and 38 seconds, and the that for TP-SIR was 6 minutes and 
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Table 1 

The average distance of the space estimated by each of the 1 methods tested to the true 
space col{V) under various settings with p — 1000. The boldfaced number in each row 
represents the best result for that simulation scenario, and the in cells represents that 
the p-value of the two-sample T-test comparing the estimation error of the corresponding 
method with that of the best method is less than 0.01. 



n 

DT-SIR 

SIRI-SIR 

SIRS-SIR 

SpSIR 

SURE-SIR 

DC-SIR 

TP-SIR 


500 

0.655(*) 

0.751(*) 

0.492 

2(U 

1.39(*) 


1.18(*) 

T 

WHHn 

0.3 

0.431U) 

0.309 

2(U 

1.29U) 


0.94U) 


Hniiiil 

0.221 

0.341U) 

0.226 

1.58(*) 





3000 

0.167 

0.245U) 

0.149 

1.48(*) 

0.816(*) 




500 

0.383 

0.396 

0.371 

2(U 

1.64(*) 

1.08(*) 

0.389 

TT 

WHHn 

0.235 

0.227 

0.256 

2(U 

1.36U) 

0.266(*) 

0.318(*) 



0.161 

0.157 

0.189(*) 

1.25(*) 

1.25U) 

nKgnfin 

0.264(*) 


3000 

0.134 

0.129 

0.153U) 

0.975(*) 

1.12U) 


0.23(*) 


500 

1.15 

1.48(*) 

1.38(*) 

2(U 

1.97(*) 

1.85(*) 

1.13 

TTT 

WHHn 

0.426 

0.974(*) 

0.596(*) 

2(U 

1.94U) 

1.57(U 

0.429 



0.263 

0.403U) 

0.29(*) 

1.33(*) 

1.89(*) 

0.996(*) 

0.338(*) 


3000 

0.214 

0.297 

0.238(*) 

1.06(*) 

1.82U) 

0.475(*) 

0.299(*) 


500 

0.263 

0.257 

0.333 

1.41(*) 


0.334(*) 

0.332(*) 

TV 

WHHn 

0.219 

0.447(*) 

0.25 

1.41(*) 

0.436(*) 





0.161 

0.4(*) 

0.196(*) 

0.42(*) 

0.442U) 




3000 

0.134 

0.377(*) 

0.177U) 

0.297(*) 

0.43(*) 


liisSill 


500 


0.529 

0.562 

2(U 

1.62(*) 

1.24(*) 

1.09(*) 

\T 

WHHn 


0.463(*) 

0.514(*) 

2(U 

1.15(U 

0.367 

0.615(*) 




0.418U) 

0.341U) 

1.51(U 

0.926(*) 

0.569(*) 

0.54(*) 


3000 

0.249 

0.399U) 

0.284U) 

1.24(*) 

0.691U) 

0.597(*) 

0.511(*) 


500 

0.568 

0.535 

0.566 

2(U 


1.24(*) 

1.08(*) 

VT 

WHHn 

0.427 

0.524(*) 

0.548(*) 

2(U 


0.39 

0.641(*) 



0.311 

0.469U) 

0.351U) 

1.51(U 

0.927(*) 

0.598(*) 

0.583(*) 


3000 

0.265 

0.456U) 

0.307U) 

1.25(*) 

0.807U) 

0.622(*) 

0.56(*) 


500 

0.556 



2(U 

1.66(*) 

1.26(*) 

1.11(*) 


WHHn 

0.436(*) 

0.528(*) 


2(U 

1.22U) 

0.39 

0.643(*) 



0.303 

0.465U) 


1.51(U 

0.747(*) 

0.589(*) 

nMnenjn 


3000 

0.258 

0.468U) 

■nBfaiW 

1.25U) 

0.698U) 

0.63(*) 



17 seconds. 

6. Conclusion. When the dimension p diverges to infinity, classical 
statistical procedures often fail unless additional structures such as sparsity 
conditions are imposed. Understanding boundary conditions of a statistical 
procedure provides us theoretical justification and practical guidance for our 
modeling efforts. In this article, we provide a new framework to show that 
p = lim ^ is the phase transition parameter for the SIR procedure. Under 
certain conditions, it is shown that the SIR estimator is consistent if and 
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Table 2 

The average distance of the space estimated by each of the 1 methods we tested to the 
true space col{V) under various settings with n = 2000. 



Table 3 

The average distance of the space estimated by each of the 1 methods tested to the true 
space coliy) under various settings with n = 500 and p — 6000. 



DT-SIR 

SIRI-SIR 

SIRS-SIR 

SpSIR 

SURE-SIR 

DC-SIR 

TP-SIR 

I 

0.694 

0.631 

0.606 

2(*) 

1.43(*) 

0.97(*) 

1.19(*) 

II 

0.446 

0.462 

0.414 

2(*) 

1.74(*) 

1.08(*) 

0.4 

III 

1.35 

1.56(*) 

1.56(*) 

2(*) 

1.99(*) 

1.88(*) 

1.37 

IV 

0.163 

0.122 

0.245(*) 

1.41(*) 

0.27(*) 

0.305(*) 

0.195(*) 

V 

0.481(*) 

0.431 

0.486(*) 

2(*) 

1.62(*) 

i.in 

0.995(*) 

VI 

0.463(*) 

0.423 

0.494(*) 

2(*) 

1.62(*) 

i.iin 

0.999(*) 

VII 

0.44 

0.412 

0.477n 

2(*) 

1.6in 

i.in 

1.03(*) 
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Directions Directions 




Fig 3: Simulated value of EZ{l3,f3) for the various methods. Left panel: 
{n,p) = (2000,1000); Right panel: {n,p) = (500,1000). 


Table 4 

Comparison of computing time under setting 11. 



DT-SIR 

SIRI-SIR 

SIRS-SIR 

SpSIR 

SURE-SIR 

DC-SIR 

TP-SIR 

n 

p=1000 

500 

1” 

1T2” 

7” 

11” 

1” 

24” 

29” 

1000 

2” 

2’2” 

20” 

11” 

1” 

1’52” 

1’2” 

2000 

3” 

3’27” 

1T4” 

13” 

2” 

7’38” 

2T8” 

3000 

4” 

4’59” 

2’45” 

15” 

3” 

6’51” 

3’7” 

P 

n=2000 

500 

1” 

2’48” 

35” 

2” 

1” 

3’46” 

1’7” 

1000 

3” 

3’27” 

1T4” 

13” 

2” 

7’38” 

2T8” 

2000 

12” 

4’55” 

2’35” 

1’39” 

12” 

14’24” 

3’22” 

3000 

30” 

6’0” 

4T0” 

5T9” 

O 

CO 

21’38” 

6T7” 


only if p = 0. When p > 0, where the original SIR fails to be consistent, we 
propose a two-stage method, DT-SIR for variable screening and selection 
in ultra-high dimension situations and show that the method is consistent. 
We have used simulated examples to demonstrate the advantages of DT-SIR 
compared to its competitors. This method is computationally fast and can 
be easily implemented for large data sets. 


Appendices 
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In the following two sections we offer some details about our theoretical 
derivations, but some more tedious intermediate steps (organized as Lemmas 
6 - 21 ) are deferred to the Supplemental Document to this article, which is 
available on line. 

A. The Key Lemma. The following lemma plays an important role 
in developing the high dimensional theory for sliced inverse regression. The 
proof of this key lemma is lengthy and technical. It will be helpful to keep 
in mind that H and u (if they are not constants) grow at very slow rate 
compared with c and n (e.g., polynomial of log(n)). Let m{y) = E[a;|y], and 
X = m(y) + e. Notations fnhj, rfih ., m, and Chj, ^h,-, e are similarly 
defined as x^j, Xh^. and x that were introduced before. 

Lemma 1. Let x gMP be a sub-Gaussian random variable which is upper 
exponentially bounded by K (see Definition 4)- For any unit vector (3 G 
let x{l3) = {x,j3) and m{/3) = {m,(3) = E[£c(/3) | y], we have the following: 

i) If var{m{(3)) = 0 , there exists positive constants Ci,C 2 and C 3 such 
that for any 6 = 0 ( 1 ) and sufficiently large H, we have 



a) If var{m[(3)) 0 , there exists positive constants Ci,C 2 and C 3 such 

that, for any n > 1 , we have 


varH{x{P)) — var{m{P))\ > —var{m{P)) 


with probability at most 



where we choose H such that for some sufficiently large 

constant C 4 . 
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A.l. Proof of Lemma 1 i) If m{(3) = 0 (or equivalently var{m{j3)) = 0), 
since 

\ ^ c — 1 c l 

<2 ^ ^ + 2 ^-e/i^ciP)^ 

for h = 1, ...,H - 1 and €H,-if3) = ^ J2i=i ^H,ii(3), we have 


varH{x{^)) — var{m{l3)) 

1 1 


H 


H 

c—1 


4 £" (7^|:'m(/3)) 


h 

=21 + 2II. 


Thus 

(A.l) F{varH{x{P)) > b) < F{I > 6/4) + F{II > 6/4). 

Lemma 17 (hi) in Supplement implies that 


^{4P)\y&SH >t) < CHexpl-j^ 


for some positive constant C. Since E[a;(/3)|y] = 0,we have K[x{f3)\y E Sh] = 
0. From Lemma 9, we know that for 1 < 6, < 77 — 1, e/^ j(/3) can be treated 
as c — 1 i.i.d. samples from e(/3)|y65^. According to Lemma 17 (iv), 


1 


C—1 


C — 


2 = 1 

Similarly, we have 


Y ^ 1 < Cl exp 


( -bjc-l) 

\8C2HK‘^ + AVbK 


P 


- ^ eH,i(/3)| > ^6/2] < Cl exp (- 
^ i=i / 


—be 

8C2HK‘^ + AVbK J ' 
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Thus, if 6 = 0(1) and H is sufficiently large, we have 


P(/> <Oi ( (i7-l)exp 


cb 




-b{c-l) \ 


—be 


\8C2HK‘^ + AVbKJ ^ \8C^HK^ + A^bK 


<Ci exp ( -O 2 — + O 3 log(i7) 


for some positive constants 01,02 and O 3 . 

Since ej(/9) are i.i.d. samples from a sub-Gaussian distribution e{(3) with 
mean 0 and upper-exponentially bounded by 2K. Lemma 19 implies that if 
b = 0(1) and H is sufficiently large, we have 


P(// > 6/4) <P(- Vei(/3)2 > 6c/4) 
n 

i 

<P(- e,(/3)2 - E[e{pf] > be/A - E[eif3f) 

I 

- J]e,(/3)2-E[e(/3)2] yeb/A-AxA 


<P 


^ Vn{eb/A-AK‘^)\ 
<Oi exp ( -O 2 - ^2 - j 

ch 

<Oiexp{ - 02 - + 03 log(//) 


for some positive constants Oi, O 2 and O 3 if H is sufficiently large. We used 
in above the fact that E[e(/3)^] < AK^. 

To summarize, if 6 = 0(1) and H is sufficiently large, we have 


ch 

F{varH{x{/3)) > 6 ) < Oi exp ( -O 2 — + O 3 log(iL) 


for some positive absolute constants 0 i ,02 and O 3 . 


A.2. Proof of Lemma 1 ii) Since x is sub-Gaussian and (3 is unit vector, 
we know that uar(m(/3)) = 0(1). Ifm(/3) 7 ^ 0 (or equivalently var{m{(3)) 7 ^ 
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0), we have 


varH{x{(3)) — var{m{P)) 

- var{m{(3)) 


+ ^^mh,{l3)eh,{l3) + eh, {(3) 


h h 

— var{m{(3)) 

+ ^2 + ^3 + ^ 4 ) 


where 


Ai = 


- var{m{f3)) 




(A.2) 




h h 

Lemma 1 ii) is a direct corollary of the following properties of Aj’s. 


Lemma 2. Let the Ai’s be defined as in equation (A.2). There exist 
positive constants Ci, C 2 and C 3 , such that for any v > 1 and H satisfying 
= Niu for sufficiently large Ni, we have that each of the following events 

i) 01 = I Ai < ^var{m{l3)) 

ii) 02 = I A 2 < ^var{m{(3)) |, 

iii) 03 = I Ag < -^var{m{l3)) 

iv) 04 = I A 4 < ■^var{m{l3)) 

occurs with probability at least 

(A.3) 1 - C. exp + c, log(if)) . 


□ 
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A.2.1. Proof of Lemma 2. 

A.2.1.1. Proof of i) : Recall definitions of the random intervals Sh,h = 
1, 2, • • • and random variable 6 ^ = 5h{^) = fy^Sui'^) have 


< 


jj - var{m{p)) 

h 

var{m{(3)) -Y^h{Lh{(3)f + X] “ X] 


=Bi + B2 


^ ~ Hno+i ■'^here no = N 21 ' for some sufficiently large constant N 2 
and let event E{e) be defined as in Lemma 11 in Section E, i.e., E{e) = 

I w |(5/i — -^1 > e, Vh |. For any w G E{eY, we have 


(A.4) 

(A.5) 

(A.6) 


Bi =Y^h{‘^)var{m{P)\y G Sh{u})) 
h 

X] var{m{p)\y G Sh{uj)) 


+ He)^var{m{(3)) 
<^^var{m{f3)), 


where inequality (A.4) follows from the fact that 5h{oj) < + e, inequality 

(A.5) follows from the sliced stable condition (4) and inequality (A.6) follows 
from the requirement that > N\v, and the fact 


(A.7) 


B2 <6 (/3>,)2 = -4 (/3>,)' 

h h ^ 

He 


<- 


1 -He 




h 


<- 


' N 21 ' 




where inequality (A.7) follows from the fact Sh > — e. 

From (A.5), we observe that 


(A.8) 




273 

Niu 


var{m{P)). 
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Combining with (A. 7), we then have 


So when E{eY occurs, we have 

’273 


Bi + B2 < 


+ 


A^ii/ ^"22^ 


1 + 


273 

Niu 


var{m{P)). 


Note that Ni and N 2 can be chosen sufficiently large so that 

d'y-j 1 

(A.9) Bi + B 2 < —^var{m{P)) < —var{m{f3)). 

Nijy Aiy 


Consequently, conditioning on E{ey where e = choose > 

Niv, then 


(A.IO) ^ Ml3)f - var{m{(3)) < ^mr(m(/3)) 


Since var{m{P)) = 0(1), and e = desired probabil¬ 

ity bound follows from Lemma 11, i.e., 

P(E(£)) < 0 , exp + log(if^V^?TT)) 

<C.exp(-C /''°y» +C,log(g)). 


for some positive constants 01,02 and O 3 . 


□ 


Remark 5. From (A.IO), conditioning on E{eY, we obtain the following 
two inequalities 


(A.ll) + 

h ^ ^ 

and 

(A- 12 ) 



In particular, jj YjK and jj \^J-hiP)\ are bounded by Op(l). 
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A.2.1.2. Proof of ii) : Denote by m'^(/3) and 

by we have 

H H 


h=l 


+ 


2 (c-l) / 1 


H 


Hc^ 

1/2 


h=l 


H 




1/2 

2 


h=l 


Hc^ 




h=l 


h=l 


= I + II + III + IV 


Before we start proving this part, we need to introduce two events and bound 
their probabilities. First, let 

(A.13) F;i(1V3,z^) = I r/(/3) > 

where r/(/3) = maxi<h<H | ~ | • According to Lemma 17 (i), 

(iv) and Bonferroni’s inequality, we have 


(A.14) P(.Fi(iV 3 ,i^)) <2Fexp 

(A.15) <(71 exp ( —C 2 


— (c — l)var{m{f3)) 


{N^uY 2CHK^ + .j^^var{m{(3))K^ 

c var{m{f3)) 




+ C 3 log(i7) 


for some positive constants Ci, C 2 and C 3 . Second, let 


E2{N^,u) = { 




1 


-varim 


m}, 


then 




2 ^ var{m{(3)) 


P(^(/V4,i^))<P 

\nc N 4 U 

( n m var{m{f3)) 
<Ci exp -C 2 Vn[c - 

V 


-K^ 


<Ciexp|-c /’'<’;<^» +C3log(g) 


for some positive constant Ci, C 2 and C 3 . It is easily to see E{N 4 ,iy) C 
EiN4,i2^). 
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For I. Conditioning on the event -E(e)'^n-Bi(A^ 3 , 1 'Y, combining with (A.12), 
we have 


h h 


< 


1 

A31/ 


2 


+ 


2 

A31/ 




var{m{P)) 


<^var{m{l3)) 


if A 3 is sufficiently large . 


Remark 6. From above, conditioning on the event E{eY H £^1(^3, vY, 
we have 

(A.16) <i^^^uar(m(/3)). 

h=i ^ 


For 11. Conditioning on £ 2 (^ 4 , vY^ 'w® have II < 

For III. When the event £(e)'^n£i(A 3 , z/)'^n£ 2 (A 4 , i^'^Y occurs, according 
to equation (A.16), 

HI < 

if A 4 is sufficiently large. 

For VI. When the event E{eY C Ei{N^,vY E2 {Na,vY occurs, from 
(A. 10 ), we know 

VI = < ^var{m{(3)) < -:^var{m{P)). 

h 

To summarize, we know that there exist positive constant Ci, £21 £3 and 
£4 such that 

A 2 <I + II + III + VI < ^uar(m(/3)) 

OU 

holds on the event £(e)'^n£i(A 3 , z/)'^n£ 2 (A 4 , i/‘^Y which is with probability 
at least 

. ^ f ^ c varimiS)) ^ 

1 - £1 exp f -£2 -- - + £3 log(£) j 

for some positive constants £i ,£2 and £ 3 . 
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A.2.1.3. Proof of iii) : Similar to the proof of Lemma 1 (i) we have 

P(^3 >b)< Cl //exp f-I 

\8C2HK^ + 4VbK2j 

for some positive constants Ci, C 2 and C 3 . In particular , if we take b 
^var{m{l3)), we know that 

^3 < r^varimiP)) 

Ibiy 

with probability at least 

l-C,exp(-c /°°y» +C 3 log(fl)) 

for some positive constant Ci, C 2 and C 3 . 

A.2.1.4- Proof of iv) : Let 

Di = D 2 = As = 

h h 

Consequently, 

P > ■:^var{m{P))^ 

(A,17) 

Note that 

\Di - var{m{f3))\ < .42 + x4i 

According to (i) and (ii), the right hand side of (A. 17) is bounded by 

^ f „ c varimiB) ^ 

Cl exp (-C 2 -- - + C 3 log(i/) j 

for some positive constants Ci,C 2 and C 3 . 


B. Proofs of theorems in section 3. 







DT-SIR 


11 


B.l. Proof of Theorem 1. Let S be the central subspace of dimension 
d p, i.e., y X x\Psx and dim{S) = d. We have the decomposition 

X = PsX + P c;A_X = Z + W 

(B.l) 

= IE[ 2 :|y] + z — E[z|y] + w = m + v + w 

where z = Psx, m = E[z|y], v = z — E[z|y] and w = P^±x. Note that m 
lies in the central curve, v lies in the central space and w lies in the space 
perpendicular to S. We introduce 


(B.2) 


rrihj, ruh,., m, 


Zhj, Zh,., z, and Whj, Wh,., w 


similar to the definition of Xhj, Xh,- and x. Consequently, we can dehne 
and have the following decomposition 


(B.3) Vl,- = + WZ^ + WVW, 

h 

where 


Z = (zi, zh,.) and>V = ...,wh,.) 


We need to bound HA^ — Ap||2 and ||>V>V^||2. 


Lemma 3. 

(B.4) llwvWlb < Op(^) 

n 

Proof. Prom Lemma 1, for any unit vector (3 X col{A), i.e. var{m{f3)) = 
0, we have 

(B.5) P(/3^>V>V^/3 > C^) < Cl exp {-C 2 P + log{H)). 

n 

for some positive constants Ci and 6*2. Then the e-net argument (see e.g., 
Vershynin [2010]) implies that ||VV’VV’'^|| < Op{^^) □ 

Lemma 4. 

(B-6) jjAx — Apll < Op • 

As a direct corollary, we have jjA^II < Op{l). 
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Proof. From Lemma 1, we have 


C 


P ( 1/3"(A, - A)/3| > ^||A ||2 ) < Cl exp ( -C 2 


c var{m{P)) 


+ C3log(iL) 


Note that we only need to verify it for /3 G col{Ap), which is a d-dimensional 
space. Then the e-net argument implies that ||Az — Ap ||2 < Op (ik)- □ 

Theorem 1 follows from Lemma 4 and Lemma 3. In fact, 

||Ah - Apll < ||A^ - Apll + II^W" + >VZ"||2 + ||>V>V"||2 


sop T + 




n 


n 


□ 


B.2. Proof of Theorem 2. Theorem 2 is a direct corollary of Theorem 1 
and Lemma 13. In fact, we have: 

II^X ~ ^x^Ap||2 

— II^X “ llsll-^Hlb + II^X IbllA// — Ap||2, 

which —^ 0 if p = lim^^oo f = 0- Cl 


B.3. Proof of Theorem 3. 

(i) The proof for part (i) is similar to the proof of Theorem 1 and the 
standard Gaussian assumption on x simplifies the argument and improves 
the results. Since w = Pgrx is normal and independent of y, there exists 
a normal random variable e A(0,1) such that w = where 5] = 

cov{w). Using the decomposition (B.3), we may write 

(B.7) W = 

\/ He 

where Ep^p is a px H matrix with i.i.d. standard normal entries. Corollary 
4 implies that 
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k-z 2 ^ 


< IIA, 


■ p \\2 + 0p { 


By the Cauchy inequality, we have 


||Ci||i< ||A,||2||>V>Vni2<Op(0. 

Thus, 

\\AH-Ap\\2<Opi^ + ^ + ^). 

In particular, if LI, n —>■ oo and p = lim ^ € (0, oo), we know that || A//—Ap ||2 
is dominated by p V yfp as a function of p. □ 


(ii) The proof for part (ii) is similar to the proof of Theorem 2 in 
Johnstone and Lu [2009] but is technically more challenging. Let D = 
and B = ZW^ + WZ'^, then 

Ah =D + B. 

Since we are working on single index model with x is standard normal, z = 
PfjX = (3z{y) for some scalar function z{y) and w = Pf^zx are independent 
normal random variables. Let S = var{w), then we can write 

W = 

y/Hc 

where E is a p x LI matrix with i.i.d. standard normal entries. 

Since z = Pz{y), we have Z = -^(3{zi^.,Z2,.,---,zh,-)- To ease notation, 
let 0^ = (zi^., Z 2 ,-, then 


(B.8) 


D = —\\0fpf3^ + 

H n 

B = (Bu^ + ufB^ where u = ^ E^^'^EO. 


Hy/C 


Let 0 < a < arctan{^) and 


(B.9) 


Nn, = < X e 


I X e MP : Z {x, P) < a and ||x|| = 1 | 


be the set of unit vectors making angle at most a where Z{x, y) is the angle 
between the vectors x and y. In order to proceed, we need the following 
lemma. 
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Lemma 5. Let (3 and P_ be the principal eigenvector of 5*+ = D + B 
and S- = D — B, respectively. There exists a positive constant uj{a) such 
that for any (3 G N^, i.e. 


Z 


(3,/3) < a, we have 


(B.IO) 



> 3 ^( 0 ) 


with probability converging to one as n ^ 00 . 


Proof. The proof is presented in Lin et al. [2015]. 

Note that and S- have the same distribution (viewed as functions of 
random terms E and 0): 


s.{E,e) = s+{-E,e). 

Let Aa denote the event |z (^f3,(3^ < a| U |z < a|, then 

E[Z (3, P) ] > E[Z (3, P) , ] + E[Z (3, P) , 

> E[Z (3,/3) ,Zl=] + ^EiZ (3,3_) ,Zl„] 

r w(a), 

> mm{a,-} > 0. 

6 

□ 


C. Proof of Lemma 5. We need the following lemmas. 

Lemma 6. Recall that u = E6 defined as in (B.8), then there 

exist positive constants Ci and C 2 such that 

0 < Cl < Ijiilb ^ C 2 

with probability converging to one as n —>■ 00 . 


Lemma 7. Assuming conditions in Theorem 3, let B and N^, be defined 
as in (B.8) and{B.9) respectively where 0 < a < arctan{A^ . 

i) There exists positive constant Ci such that for any x G N^, we have 
\\Bx\\ > Cl with probability converging to one as n ^ 00 ; 


a) For any x G Na, we have cos Z{x, Bx) 
verging to one as n ^ 00 . 


< Aa with probability con- 






DT-SIR 


15 


The following lemma is borrowed from Johnstone and Lu [2004]. 

Lemma 8. Let ^ be a principal eigenvector of a non-zero symmetric 
matrix M. For any r/ 7 ^ 0, 

Z(?7,Mr7) < 3Z(r7,4). 

The proof of Lemma 5 is made plausible by reference to the Figure Cl. 


Fig Cl: An illustrated graph 


Since 

(C.l) 

(C.2) 


sin (^Z 


= sin (cji + 0 J 2 ) = sin (vr — cji — 0 J 2 ) 
> min < sin (wi), sin ((Ua) >, 


we only need to prove that there exists a positive small constant uj{a) ( < ^ 
) such that sin (wi), sin (a; 2 ) > sin(u;(a)) . In fact, if such uj[a) exists, we 
may choose M = S'-,^ = /3_ in Lemma 8 and get 


(3,3- 


> (3,5'-3) > ^w(a). 


Forajs- From Lemma 7 ii), cos Z(/3, il/3) < 4a, we know that there exists 
positive constants 5{a){< 5) such that sinwa > sin {6(a)) . 


For cui. Applying the law of sines to the triangle A(0, Oi, O 2 ) , we have 
sina;i ^ sinu ;3 [ smZ (.^3,3) \ 

Ilii3ll “ 11^311 \ 11^311 ) ' 


Note that from Lemma 7 i), there exists a constant Ci > 0 such that 
\\BP\\ > Cl and 

\\dM < \\D\\<h\ef + \\-EE-\\, 

H n 

is bounded by an absolute constant C given limn^oo ^ = P ^ sliced 
stable condition. Then (C.3) implies 


||S/3||sinZ(5/3,3) ^ Cisin(5(a) . , 

smwi = -- > smw >0 


11^/311 


C 
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where w' ( < ^ )is an small angle such that the last inequality holds. In 
particular, we have wi > a;'. Hence 

Z(3,5_3) > 0 ;'a (5(a) = a;(a) 


□ 


C.l. Proof of Lemma 6 Proof : In fact, let T be an orthogonal matrix 
such that T/3 = (1,0 • • • , 0)"^ and M = T/3/3’^T'^, then 

cH'^u^u = e^E'^T.EG 

= e^E^EO - e^E^/Sp^EO 
= e^E^T^TEO - e^E^T^{Tp)p'^T'^TEe 

= e^E^EO - e^E^MEO 

p 

where = means equal in distribution. Note that E'^E is full rank H x H 
matrix, combining with Lemma 14 , we know that 


^ 1(1 


,1 


1 


r < Xmini-E^E) < \^ax{-E^E) < C2(l + 


P 


P 



for some positive constants Ci and C 2 with probability at least 1 — 2 exp(—p/ 8 ). 
Note that lim ^=p>0asn— 7 >oo and n = He, we know there exists posi¬ 
tive constants Ci and C 2 such that 

Ci^wof < \\uf < C2^\\e\\^ 

with probability at least 1 — 2 exp(—p/ 8 ). 

On the other hand, the sliced stable condition implies that lim-^||0|p 
exists ( 7 ^ 0 ), so is bounded away from 0 and 00 with probability 1 as 
n —>■ 00 . □ 


C.2. Proof of Lemma 7 Proof: For i), \/x G N^, let 

X = cos((5)/3 -L sm{S)r] where rj ± P, ||t 7 || = 1,5 < a. 
Since Bx = cos{5)u + {u^Vi) sin(5)/3, we have: 
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for some positive constant Ci. 

For a), since 

x^Bx = 2{u'^r]) cos{6) sin(5) 
we have that uniformly over Ng, 

\x'^Bx\ < |ii'^77| sin(2(5) 

which in turn implies: 

cos{Z{Bx, x)) 


x'^Bxl sin(2(5)|ti'^r7| 

< . ^ ' < 4<5 < 4a. 


licll \\Bx\ 


— 1 


2 COS 


u 


□ 


D. Appendix B: Proofs of Theorems 4 to 6. 


D.l. Proof of Theorem 4- Let a;(A;) =< > where/3^ = (0,1, ...0) E 

MP with the only 1 at the A:-th position. Recall that 

T =1 k E,[x{k)\y] is not constant. | 

^p(^) =1 k varH{x{k)) > ^ | 

^p(^) =1 ^ varH{x{k)) < t |. 

and |T| < Cs for some positive constant C. Since var{K[x{k)\y\) > we 
may choose t = -^ for sufficiently small positive constant a such that for 
any k ^ t < ^var{'E[x{k)\y]. According to Lemma 1 and the Bonferroni’s 
inequality, we have 

P(r" C £p{t)) > 1 - ^ P( varnixik)) > t) 


> 1 - Cl exp ( -C'2-^ + Cs log(iL) + log(p - s) ) . 


and 


p(rc4{t))>p n{ varH{x{k)) > -var{m{k,y)) | 

Vfcer 

P ivarn {x{k)) < l-var {m{k, y)) 


> 1 - 


fcer 


> 1 _ c. exp I + C 3 log(i^) + log (Ce 


i.e., we have (11) and (12) hold. 


□ 










18 


Q. LIN, Z. ZHAO AND J. S. LIU 


D.2. Proof of Theorem 5 By choosing H,c and t = -^ properly, from 
Theorem 4, we have 

P(f = r) > 1 - Cl exp + Cs \og{H) + \og{p - s)) 

for some positive constants Cj, i = 1, 2,3. When T = T, we have |T| = ©(s). 
For the n samples {Yi,X^), apply Theorem 1, we have 


I /tCT, . ,, ^ n'r-r.r .rrii / 1 

) —-^plb < ||Ai^ ~-^p’ II2 — Op(-^ H -—h 


In particular, with probability converging to one, we have 



^7 ,7 

||e(Aj|^ ) — Ap ||2 —)• 0 as n —)• 00 . 

D. 3. Proof of Theorem 6. The proof is almost identical to the proof of 
Theorem 2, except that we additionally need to use Theorem 1 in Bickel 
and Levina [2008]. 

E. Appendix C 
E.l. Assisting Lemmas 


Definition 2. A set of random variables xi, ...,Xn can be treated as i.i.d 
random samples from a random variable x, if for any n variates symmetric 
function f{wi,...,Wn) , /(xi,...,x„) is identically distributed as f{zi,...,Zn) 
where zi,...,Zn are i.i.d random samples from x. 


Lemma 9. Let {xi,yi) be n i.i.d random samples from a joint distribution 
{x,y). Sort these samples according to the order statistics of yi’s and denote 
the sorted samples by (x(i),?/(i)), (x( 2 ),y( 2 )), (a:(n),y(n)). Then for any a, 

b{l<a<b + l< n), X(^a+i)i can be treated as b — a i.i.d samples 

from X {y e [y(a),y{b+i)]) ■ 

Proof. In fact, we only need to prove that y(a+i) r " ^ V(b) can be treated 
as b — a i.i.d. samples of y (y G [2/(a)) ^(fe+i)])- The latter only needs to be 
proved for uniform distribution which can be verified directly. 


Corollary 1. In the slicing inverse regression contexts, recall that Sh 
denotes the h-th interval {yh-i,c-,Vh,c\ for 2 < h < H — 1 and Si = {—oo,yi^c\, 
Sh = (yH-i,c) 00 ). We have that Xfi,i, i = 1, - ■ ■ , c — 1 can be treated as c — 1 
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random samples of x {y & Sh) for h = 1, ...,H — 1 and xh,i, ■■■,xh can be 


treated as c random samples of x 


{y G Sh)- 


Lemma 10. Suppose that {x,y) are defined over a-finite space X 
and g is a non-negative function such that E[( 7 (x)] exists. For any fixed pos¬ 
itive constants Ci < 1 < C 2 , there exists a constant C which only depends 
on Ci,C 2 such that for any partition M = Ufe=i‘S'/i where Sh are intervals 
satisfying 


(E.l) 
we have 


^<ny^Sh)<^,'^h, 


supE{g{x) y e S'h) < CHE[g{x)]. 

h 


Proof. According to Fubini’s Theorem, for any h, 

^[ 9 {x)p{x\y e Sk)dx 

>E{y G Sh) [ gix)p{x\y G Sh)dx. 

Jx 

Due to the condition (E.l), there exists a positive constant C such that 

g{x)p{x\y G Sh)dx < CHE[g{x)]. 


IX 


□ 


Corollary 2. Let x be a multivariate random variable with covariance 
matrix El. For any partition satisfying (E.l), there exists a constant C such 
that 

a;|yg 5 ^) < CHvar{(3'^x), for any unit vector (3, 

and 

^max {var {x\y G S'h^^ < CHXmax {var (x)) . 

Corollary 3. Let x be a sub-Gaussian random variable which is upper- 
exponentially bounded by K. Then for any partition satisfying (E.l), there 
exists a constant C such that 


J 


X' 


2 \ 


y€S'h]< CFE[exp 
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Recall the definition of the random intervals Sh,h = 1,2, ■■ ■ ,H and ran¬ 
dom variable 4 = 4(a;) = f(y)dy- 


Lemma 11. Define the event E{e) = | w \^h — ^\ > e,V/i |. 
exists a positive constant C such that, for any e > we have 


There 


^ 

(E.2) P{E{e)) < CH^VHc + 1 exp(-(R'c + 1) —) 

O.Z 

for sufficient large H and c. 


Proof. The proof is deferred to the end of this paper. 


E.2. Some Results from Random Matrices Theory. We collect some di¬ 
rect corollaries of the non-asymptotic random matrices theory (e.g., Rudel- 
son and Vershynin [2013]). 

Lemma 12. Let M be any p x n matrix {n > p) whose columns Mi are 
independent sub-Gaussian random vectors in with second moment Ip and 
^tingmin(EL), losing,max{M) be the minimal non-zero and maximal singular 
value of M. Then for every t, with probability at least 1 — 2exp(— we 
have : 


Vn - Cy/p - t < Xting,miniE^) ^ Xsing,max{M) < y/fi + Cy/p + t. 

Lemma 13. Let xi,--- ,Xn be n i.i.d. samples from a p-dimensional 
sub-Gaussian random variable with covariance matrix El and p = f • If there 
exists positive constants Ci and C 2 such that 


Cl < Ainin(^a!) < Aniax(^3:) < (^2- 


Let XiXj. Then 


IIS 3 ; — Sa ;||2 —>•0 if p = 0 when n —>■ oo. 

It is also easy to see that, given the boundedness condition on E‘X , ll^x - 
II 2 —^ 0 i/p = ^ 0 when n oo. 

1 /2 

Proof. Let Xi = rrii where m* is sub-Gaussian random variable 
with covariance matrix Ip and M = (mi,...,m„). Erom Lemma 12, we 
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have 


1 


-MM^ - I 


n 


■pl|2 


and 


IS ^ - S"^||2 = - Ip||2||S-i/2| 


n 


with probability converges to 1 as re —>■ oo because 


Xmax < (l + 


and 


with probability at least 1 — 2exp(—C"p). 


□ 


The following lemma is well known in the non-asymptotic random matrix 
theory (Vershynin [2010] Proposition 5.34 ) which is slightly different from 
the Lemma 12. 


Lemma 14. Let EpxH be apx H matrix, whose entries are independent 
standard normal random variables. Then for every t > 0, with probability at 
least 1 — 2exp(—1^/2), we have : 

Xsing,mini^P^H) ^ ^/p ''/H t, 


and 


'Sing,max 


(EpxH) 


<^/p + Vh + 1. 


Corollary 4. We have 

2 yVP ~ V^j ^ {^Pxh) < Xsing,max {Epxu) < “ (^\/P + vHj . 

with probability converging to one, as re —oo. 


Proof. Choosing t = y/pl‘2, according to Lemma 14, we have: 

jp / Xraax {Eh ) ^ sA > p ^ Xmax {E}j ) ^ _|_ 'JP ^ 

+ + 2^ + 2Vh) 
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and 

]p f ^ l\ > p ^ ^maxiEn) ^ ^_^ 

[^-Vh- 2^-2VHJ 

with probability at least 1 — 2exp(—p/8), i.e., With probability converging 
to one, we have 

~ — ^min(^pxH) ^ ^max{EpxH) ^ IjiVP '/H). 

□ 

E.3. Basic Properties of sub-Gaussian random variables. We rephrased 
several equivalent definitions of the sub-Gaussian distribution here (See e.g., 
Vershynin [2010] ): 

Definition 3. Let x be a random variable. Then the following properties 
are equivalent with parameters Ki’s differing from each other by at most an 
absolute constant factor, 

1. Tails: P(|x| > t) < exp(l — f^/Kf) for all t > 0. 

2. Moments: (E[|x|^])^/p < K 2 ^ for all p > 1. 

3. Super-exponential moment: Eexp(x^/it' 3 ) < e. 

Moreover, if E[x] = 0, then the properties 1 — 3 are also equivalent to the 
following one: 

4 . Moment generating function: E[exp(tx)] < exp(t^iir|). 

Definition 4. For a sub-Gaussian random variable x with the constants 
Ki,i = 1,2, 3,4 given in Definition 3, we will call a constant K an upper- 
exponential bound of x or x is upper-exponentially bounded by K if K > 
meiKi{Ki,K 2 ,K 3 , K 4 }. 


We summarize some properties regarding the sub-Gaussian distributions 
into the following lemmas. 

Lemma 15. Let di, ...,6n ben (not necessarily independent or with mean 
zero) sub-Gaussian random variables upper-exponentially hounded by K. 

i) ^ sub-Gaussian and upper-exponentially bounded by K. 

ii) (5i — E[(5i] is sub-Gaussian upper-exponentially bounded by 2K. 
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iii) If they are independent and with mean zero , then 5i is sub- 

Gaussian and upper-exponentially bounded by K. 

iv) If they are i. i. d., then we have the concentration inequality: 


P 



E[x]| > < 2 exp ^ 


—nw 


2K‘^e + 2tK 


Proof, i) follows from the linear property of expectation and the the 
convexity of exponential function, i.e., 


E[exp( 


1 


< maxE[exp(^)] < e. 

i h 


ii) From Definition 3, we know that |E[(5j]| < K which gives us the desired 
upper-exponential bound of 6 i — E[(5j]. 

iii) is trivial as (5i, • • • ,6c are independent and with mean zero. 

iv) Since is sub-Gaussian upper-exponentially bounded by K, we have: 


POO POO 

E[\ 6 i\P]= ptP-^F{\ 6 i\ > t)dt < ptP-^ 

Jo Jo 


exp 1 


= ^r{^)KP 


< p\KP 


-2{K\) 



for any p > 1 
for any p >2 


Recall the well known Bernstein inequality. 


Lemma 16. ( Bernstein Inequality ). If there exists positive constants 

V and b such that for any integers m >2, 

E[\Sir] <m!6™-V/2 

then 

(E.3) p(|£kL_E[xl|>t)< 2 exp(^^). 

By chooing V = K^e and b = K, we get the desired concentration in¬ 
equality. □ 


Lemma 17. Suppose that (x, y) are defined over a-finite space X 
and X is sub-Gaussian with mean 0 and upper exponentially bounded by K, 
let m{y) = E[x|y], e{y) = x — m{y), then we have 
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i) m{y) and e{y) are sub-Gaussian and upper-exponentially bounded by 
K and 2K respectively. 

ii) Let Z consists of points y such that x\y is not sub-Gaussian, i.e., 

then F{y G Z) = 0. 

iii) For any fixed positive constants Ci < 1 < C 2 and any partition M = 
U^i Sh where Sh are intervals satisfying 

^<ny^Sh)<^,'^h, 

there exists a constant G such that 

supP(x|j^g 5 ^ > t) < CHexp 
h 



3te 


(0,to] such that / exp{tx‘^)p{x\y)p{y)dx = oo\, 
J X 


As a direct corollary, we know that there exists a positive constant G 
such that 


E 


exp 


V 2^2 J 


< CH, 


and 


E 


{x\y&Sh 


771 

< CHmK'^T{^)f2. 


vi) Suppose that x\y^Sh *-5 defined as in iii). Let Xi, i=l,...,c be c samples 
from x\yeSh, Xh = ^ and yn = E[x|y 65 ^], we have 

P[|x,-wl>t]<2exp( ^^^g*+2^^. ). 

Proof, i) By Jensen’s ineqnality, we have 

E[exp(tE[x|y])] < E[E[exp(tx)|?/]] = E[exp(tx)] < exp(t^Kf). 

i.e., m{y) is snb-Ganssian and upper-exponentially bounded by Ki. Since x 
, m{y) is sub-Gaussian and upper-exponentially bounded by Ki, we know 
that e = X — m{y) is sub-Gaussian and upper-exponentially bounded by 
2Ki. 
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ii) Let p{x,y) be the joint density function of {x,y) and p{x), p{y) be the 
marginal distribution of x, y. Since x is sub-Gaussian, we know there exists 
to > 0 such that 



p{x\y)p{y)dydx < e for 0 < t < to- 


By Fubini Theorem, we know 


(E.4) 


>y 


P{y) / exp(tx^ 

Jx 


)p{x\y)dxdy < e for 0 < t < to- 


Recall that we have Z = {y\3t G (0, to] such that exp{tx‘^)p{x\y)p{y)dx 
oo}, from equation (E.4), we know P(y G Z)=0. In particular, we know that 
for any y ^ Z, x\y is sub-Gaussian. However, the norm (e.g., sub-exponential 
norm)of x\y might be varying along with y and , as a function of y , it might 
be not bounded. 


iii) Erom Lemma 10, we know that there exists a positive constant G such 
that 


/ exp{tx'^)p{x\y G Sh)dx < CHe. 
Jx 


For simplicity if notation, we will denote x\y£Sh by -z through out this lemma. 
So for 0 < t < to = we have 

> „) < < CHe^p(-tV) . 

exp(t^a^) ^ ^ 

From the above tail bounds, we have that for any integer m > 0 

/ OO /•OO J.2 

P(| 2 ;| > dt < CHm / exp{—dt 

<CHmK^{ml2)l2. 


We then have 


E [exp {tz"^)] < 


E E [t 


< 


m=0 

oo 

E 

m=0 


ml 


fm ^ 2 m] 

sE 

m=0 

ml 


m^2m'\ 


ml 


CH Y 


2m 


m=0 


From which we know if 0 < t < the R.H.S is bounded by CH for a 

positive constant G. 
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vi) Prom the previous proof, we know that for any integer m>2 


Ed^r] < CHmlK^ = m\K^-^{2CHK^)/2. 
By the Bernstein inequality (E.3), we have: 


_ E[z] > < 2exp 



□ 


Lemma 18. Let Zi,i = I, ■ ■ ■ ,n be i.i.d. samples of a sub-Gaussian dis¬ 
tribution exponentially upper bounded by K, then there exist positive con¬ 
stants Cl, C 2 such that, if ^Jne oo, we have 



where z = 

Proof. Recall the following Hanson-Wright inequality in Rudelson and 
Vershynin [2013] 

Lemma 19. Let v = ■ ■ ■ ,x{n)) be a sub-Gaussian random vector 

with independent components x{(3) such that E[a;(/3)] = 0 and ||£c(/3)||^2 < 
K. Let A be an n X n matrix. Then there exists a positive constant G such 
that for any t > 0, 



Here the ip 2 norm of a random variable z is defined as ||2;||^2 ~ sup^p 
and the HS norm of a matrix A is defined as ||A||i ^5 = (Ylij 


Since 
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and Zi — E[z] are sub-Gaussian with mean 0, from Lemma 19 by choosing 
A = ilp and = {z\ — £[ 0 ],Zp — ^[z]), we have 

(E.5) p|^|i^(zi-E[z])2-E[(z-E[z])2]| >e^ <2exp(-C^), 

since ^/ne —)> 00 . 

The following follows from the usual deviation argument: 

P ((E[ 2 ;] — z)"^ > e) <Ci exp {—C 2 ne). 

Combining with the estimate (E.5), we know there exists positive con¬ 
stants Cl and C 2 such that 

P ^1^ - var{z)\ > - C'lexp , 

for sufficiently large n since y/nt —)• 00 . □ 

E.4. Proof of Lemma 11. We only need to prove this lemma for n i.i.d. 
sample yiS from a uniform distribution over [0,1]. We slightly change the 
notation of order statistics y(j) to y^i^n) so that we can keep track of the 
sample size. Since y is uniform distribution on [0,1], it is well known that 
y{i,n) ~ Beta{i,n — i + 1) with expectation and mode Lemma 11 
is a direct corollary of the following lemma. 

Lemma 20. Suppose there are n = He i.i.d. samples from uniform dis¬ 
tribution over [0,1], when H,c are sufficiently large, we have the following 
large deviation inequalities o/y(fcc,Hc )7 k = 1, - ■ ■ , {H — 1). 

i) There exists a positive constant C, such that for any e > we 

have 

P {y{kc,Hc) > ^ + ^ - CHy/Hc+lex.Y> ^-(iLc-M)y^ ; 

ii) There exists a positive constant C, such that for any e > TjJrx; 
have 

IP {y{kc,Hc) - CHyjHc + le^p ^-(iLc-h 1)^^ ; 
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in) Let6{k,H,c) = \y((k-i)c,Hc)-y{kc,Hc)\ ,for2<k< H-1, = 

\y(c,Hc)\ 6 {H,H,c) = |1 — y((H-i)c,Hc)\- There exists a positive 

constant C, such that for any e > , we have for any 1 <k < H: 


H' 


\6{k,H,c) —ttI > e ) < CHy/Hc + 1 exp ( —{He + 1)^ 
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We will prove Lemma 20 later. Assuming it, we have 

H 


'{E{e))<Y,^ 


k=l 




< CiLVi/c + lexp ( -{Hc+l)— ) . 


□ 


E.4.1. Proof of Lemma 20. The first part: For any 1 < k < H — 1, we 
note that 


y{kc,Hc) ^ ^ ( y{kc,Hc) > 


kc 


+ e 


B{kc, Hc-kc+ 1) 


Hc+l 

- xf^-^^dx. 


When e > we know the mode xm = ^c-i < — h^c+i + e > so we 

have 


(E.6) 


y(kc,Hc) ^ jf ^ — 


< H 


{xd)’^^-^{1-xd)^^-'^^+^ 
B{kc, He — kc + 1) 
{ xd )''%1 - 
B{kc, He — kc + 1) 


The last inequality due to Hxd > 1- ^ 1) then P{y(kc,Hc) > + ^) = 

0 and Lemma 20 holds automatically. 

We may assume that e + -^ < 1 below. Let us denote the right hand side 
of (E.6) by A, then 


log(A) = log(Lr) + kclog{E + e) + {He — kc+ 1) log(l — E — e) 
+ log{Hc + 1)! — log(/cc)! — \og{Hc — kc+ 1)! 

— log{Hc + 1) + log(fec) + \og{Hc — kc+ 1), 
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where E = According to the Stirling formula: 


login!) = nlog(n) — n H— log(27rn) + 0(—), 

2 n 

when m is sufficiently large we have; 

log(A) =log(i7) + (iJc + l)(^log(l + ^) + (1 - £^)log(l - 

— ^{\og{Hc + 1) — log(/cc) — \og{Hc — kc+ 1)) 

1 




(E.7) 


<log(g) -\\og(2n) 

- ^{\og{Hc + 1) - \og{kc) - log{Hc - kc+ 1)) + O(-) 

^ C 

< log{H) - ^ - i(log(i7c + 1) - log(A:c) - log(iJc -kc+ 1)), 

where we use the fact that < E + e < 1 and the following elementary 

lemma, which can be proved by the Taylor expansion; 

Lemma 21. Suppose a, b are positive numbers such that a + b = 1, then 
for any 0 < e < b, we have: 


alog{l + + 61og(l - ^) < 


Now we know that there exists a positive constant C such that for any 
1 < k < H — 1 and for any e > the following holds: 


'’(y(kc,Hc) ^ ^ 

< CH 


{kc){Hc — kc+ 1) 


Hc+1 


exp ' — (iLc + 1) 


2{1-E) 


< CHVHc + 1 exp ( —{He + 1) 


'2{l-E) 


< CHy/Hc + 1 exp ( —{He + 1) 


The last inequality follows from > e^ since < E < 
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The second part: The proof of the second part is similar. For completeness, 
we sketch some calculations below. For any 1 < k < H — when e > 
we have 


k 

y{kc,Hc) ^ Jp ~ - ^ ( hike,He) < 


kc 


-e/2 


B{kc, He — kc+ 1) 


Hc+1 


Hc+1 


Since e > , we know the mode xm = > ^D' — h^c+i ~ 

have 

y{kc,Hc) ^ — e I < 


H 


<H 


B{kc, He — ke + 1) 


B(ke, He — ke + 1) 


The last inequality due to H{1 — x d') > 1. 

The rest is similar to the first part. We have that for any 1 < k < H — 1 
and for any e > 


(E.8) P ^(kc,Hc) < 


k 


H + 1 


— e) < CH+He + 1 exp ( —{He + 1) — 


The third part: The third part is a direct corollary of the first two 

parts. Note that for any 2 < k < H — 2, for any e > 

1 


,H,e) - > e 


k + l k 

yik+i)c,Hc —— [ykc,Hc - j^) 


> e 


< IP (^yik+i)c,Hc - 


k + l 
H ' 


k 


> O +IP \ykc,Hc- TlI > o 


H' 


< CH\/He+ lexp ( -{He + 1) — 
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When A: = 1, we have 
P 


- ^1 > e) = P {\yc,Hc-^\ > 


< P ( yc,Hc - — > e ) + P ( yc,Hc - 77 < -e 


H 


< CHVHc + 1 exp ( —{He + 1) — 


< CH^/Hc + lexp ( -{He + 1) — 


When k = H, we have 

- 1, c) - ^I > = P (^\y(H-i)c,Hc - 


iJ- 1, 
H ' 


> e 


_ Y 

< P ( y{k+l)c,Hc -> e j + P [ y(H-l)c,Hc 


< CHy/He + 1 exp ( —{He + 1) — 


H -I 


H 


< -e 


< CHy/He + 1 exp ( —{He + 1) 
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